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Sir: 



This Appellants' Amended Brief is filed in response to a Final Office Action 
dated July 15, 2005, an Advisory Action dated September 20, 2005, a Notice of 
Appeal received October 17, 2005, and a Notification of Non-Compliant Appeal 
Brief dated April 5, 2006. There is no evidence submitted and relied upon by the 
appellant in the appeal so item number 8 on the PTOL-462 form entitled 
Notification Of Non-Compliant Appeal Brief (37 CFR 41.37) is believed to be 
compliant. Reconsideration of the Application, withdrawal of the rejections, and 
allowance of the claims are respectfully requested. 
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I. REAL PARTY IN INTEREST 

The real party in interest is International Business Machines (IBM) of* 
Armonk, NY. 

II. RELATED APPEALS AND INTERFERENCES 
There are no related appeals or interferences. 

III. STATUS OF CLAIMS 
Claims 1-20 are pending. 
Claims 1 through 20 are rejected. 

The Appellants are appealing the rejection of independent claims 1, 14, 
and 20 (all other remaining claims depend from these claims). Claims 1,14, and 
20 are on appeal. 

IV. STATUS OF AMENDMENTS 

The Examiner issued a final rejection of claims 1-20 in the Final Office 
Action of July 15, 2005. Appellants submitted a response without amendment to 
this Final Office Action to overcome the Examiner's rejections. The Advisory 
Action dated September 20, 2005 addressed the Appellants' remarks and 
indicated that the response without amendment was entered. 

V. SUMMARY OF THE CLAIMED SUBJECT MATTER 

This summary references line numbers of the specification as filed. It is to 
be noted that the text of each page of the filed specification starts with line 
number5. 

The pending independent claims under appeal in this case are 
corresponding method, computer readable media, and apparatus claims. An 
advantageous application of the claimed subject matter that provides a clear, 
concise understanding of the method and apparatus of the present claimed 
invention is described in the specification at page 5, lines 18-29 to page 6 f lines 
1-8. Independent claim 20 is used herein to describe the subject matter defined 
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by the independent claims under appeal. Independent apparatus claim 20 sets 
forth the following subject matter. 

A) A retrieval unit for retrieving a web document at an address . FIG. 3a: 
reference 308a; FIG. 3b: reference 302b; FIG. 5: references 504 and 506; and 
Specification at page 10, lines 14-15 and page 12 r lines 1-3. 

B) for extracting contents of the web document . FIG. 3a: reference 310a; and 
Specification at page 12, lines 11-13. 

C) for rendering an intermediate dynamically constructed in-memorv webpaae 
representation of the web document at a hub processing unft . FIG. 3a: reference 
310a; FIG. 6 generally; and Specification at page 10, lines 17-19 and at page 12, 
lines 8-29 to page 13, line 1. 

D) which is formatted as if displayed for viewing on an end-user's web browser . 
See Id. 

E) A loader for loading secondary documents, as reouired. associated with the 
web document in order to render the secondary documents as part of the in- 
memorv webpage representation . FIG. 3b: reference 304b; FIG. 6: references 
606-614; and Specification at page 10, lines 15-17 and page 13, lines 2-5. 

F) wherein the secondary documents include one or more images with textual 
content embedded therein . See Id. 

G) wherein the hub processing unit renders the in-memorv webpaoe prior to 
analyzing and summarizing the in-memorv webpaae . FIG 3b: references 306b, 
308b, 310b; and Specification page 12, lines 8-29 to page 13, lines 1-8 and page 
14, lines 1-16. 
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H) A summarizer for analyzing and summarizing the in-memorv webpaae 
representation . FIG. 3a: reference 316a; FIG 3b: reference 310b; FIG. 7 and 
FIG. 8 generally; and Specification at page 10, lines 19-22 and page 14, lines 2- 
16. 

I) to produce a text map for the weboage document of the textual contents 
therein . FIG. 7 generally; and Specification at page 13, lines 18-29. 

J) An optical character recognition engine for use on the images to extract textual 
content for adding to the textual map for the webpaoe document FIG. 7: 
reference 712; and Specification at page 13, lines 25-28. 

Independent claims 1 and 14 are the method and computer readable medium 
equivalents to the independent apparatus claim 20. Accordingly, support for 
independent claims 1 and 14 is found in the specification as detailed above with 
respect impendent claim 20. 

VI. GROUNDS OF REJECTION TO BE REVIEWED ON APPEAL 
Whether claims 1, 14, and 20 are unpatentable under 35 U.S.C. §1 03(a) 

over Meyerzon et at. (U. S. Patent No. 6,638,314) in view of Lawrence et al. 
(U.S. Patent No. 6,289,342) and in further view of Blumentha! (U. S. Patent No. 
6,026,409) . 

VII. ARGUMENT 

A. WHETHER CLAIMS 1, 14, AND 20 ARE UNPATENTABLE OVER 
MEYERZON IN VIEW OF LAWRENCE AND IN FURTHER VIEW OF 
BLUMENTHAL 

In the Examiner's Office Action of July 15, 2005, the Examiner rejected 
claims 1-3, 14-16, and 20 under 35 U.S.C. §1 03(a) as being unpatentable over 
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Meyerzon et al. (U.S. Patent No. 6,638,314) (Hereinafter Meyerzon), in view of 
Lawrence et al. (U.S. Patent No. 6,289,342) (Hereinafter Lawrence) and in 
further view of Blumenthal (U.S. Patent No. 6,026,409) (Hereinafter Blumenthal). 
The Appellants respectfully submit that claims 1-20 are patentable over 
Meyerzon and/or Lawrence and/or Blumenthal under 35 U.S.C. § 103(a). The 
Appellants assert that neither the Meyerzon, Lawrence, nor Blumenthal 
references, taken either alone or in combination with one another, teach or 
suggest the claimed limitations, particularly the claim limitation of "rendering an 
intermediate dynamically constructed in-memory webpage representation of the 
web document at a hub processing unit which is formatted as if displayed for 
viewing on an end-user's web browser." 

Appellants respectfully suggest selection of independent claim 1 as 
representative of the independent claims on appeal. Independent claim 1 is 
directed towards a method for browser-enhanced web crawling associated with a 
network of hub processing units coupled to a plurality of information processing 
units over a network, the method executed by a web crawler on a hub processing 
unit associated with the network comprising: 

retrieving a web document at an address, and extracting 
contents of the web document for rendering an intermediate 
dynamically constructed in-memory webpage representation of the 
web document at a hub processing unit which is formatted as if 
displayed for viewing on an end-user's web browser : 

loading secondary documents associated with the web 
document in order to render the secondary documents as part of 
the in-memorv webpage representation , wherein the secondary 
documents include one or more images with textual content 
embedded therein, wherein the hub processing unit renders the in- 
memorv webpage prior to analyzing and summarizing the in- 
memorv webpage : 
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analyzing and summarizing the in-memorv webpaqe 
representation to produce a text map for the webpage document of 
the textual contents; and 

using optical character recognition on the images to extract 
textual content for adding to the textual map for the webpage 
document. 

The Appellants assert that, in particular, the underlined portions of the 
above claims are not taught or suggested by the Meyerzon and/or Lawrence 
and/or Blumenthal references, taken either alone or in combination with one 
another. 

The claims were rejected under 35 U.S.C. §1 03(a). The Statute expressly 
requires that obviousness or non-obviousness be determined for the claimed 
subject matter "as a whole," and the key to proper determination of the 
differences between the prior art and the present invention is giving full 
recognition to the invention "as a whole.* As discussed below, the Appellants 
assert that these limitations, especially when considered in the context of the 
other limitations of claim 1 , are not described in the prior art references of record 
and that these limitations render the claimed subject matter unobvious over the 
prior art. 

The present invention is advantageous over the prior art for many 
reasons. First, the present invention permits the fault-tolerant gathering of 
dynamic data documents on the World Wide Web. For example, the present 
invention is able to summarize web documents containing executable client side 
software code. Second, the present invention allows for the interpretation and 
summarization of textual and other information contained within the body of a 
web-based image document. In one embodiment, the present invention 
implements optical character recognition with web crawling so that a web crawler 
is able to property summarize images and image maps that contain textual or 
other data. 
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The Enhanced Browser Based Crawler of the present invention enhances 
existing document gathering and analysis by, for example, dramatically improving 
the quality of the extracted metadata. This is due to the fact that the 
summarization of a document is based on the whole and complete document as 
it was designed by the document's author; the static heterogeneous data as well 
as the problematic dynamic data is completely rendered and integrated into the 
metadata for subsequent indexing of all metadata by a web crawler. For 
example, a dynamic in-memory representation of the web page, as intended to 
be seen by an end user, is created to extract the most accurate and 
comprehensive data set possible. A standard web crawler is not able to 
compose this type of highly dynamic and distributed document that includes 
dynamic information such as client side script, applets, or their equivalents. 

Overview of Prior Art 

To begin, the Meyerzon reference Is directed towards a web crawler 
program that includes a gatherer process for gathering information pertaining to 
electronic documents. See Meyerzon at col. 8, lines 58-60. In the system of 
Meyerzon, worker threads process URLs and then pass each URL to a filter 
daemon. See Meyerzon at col. 9, lines 13-16. The filter daemon uses the URL 
to retrieve the electronic document at the address specified by the URL. See 
Meyerzon at col. 9, lines 16-20. After retrieving an electronic document, the filter 
daemon parses the electronic document and returns a list of text and properties. 
See Meyerzon at col. 9, lines 29-31. The worker thread then passes the list of 
properties and text to the indexing engine for creating an index which is used by 
the search engine in subsequent searches. See Meyerzon at col. 10, lines 13- 
16. A user may then examine the list of documents returned by the search 
engine, select a document, and then the web browser displays the selected 
document to the user. See Meyerzon at col. 8, lines 23-25 and 32-35. 

The Lawrence reference is directed to an autonomous citation indexing 
system that can be used as an assistant agent for automating and enhancing the 
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task of finding publications in electronic form such has on the World Wide Web. 
See Lawrence at the Abstract. A citation index is autonomously created from 
literature in electronic form by an autonomous citation index ("ACI"). See 
Lawrence at column 7, lines 50-52. If the literature is not in electronic form, 
optical character recognition ("OCR") can be used to convert the literature to 
electronic form. See Lawrence at column 7, lines 51-53. The ACI system can 
then autonomously locate new articles, extract citations, identify citations to the 
same article which occur in different formats, and identify the context of citations 
in the body of articles. See Lawrence at column 7, lines 53-58. 

The Blumenthal reference is directed towards a system and method for 
the visual search and retrieval of digital information within a single document of 
multiple documents. See Blumenthal at column 7 ( lines 9-1 1 . A viewing window 
has a first pane that provides a global view of digitally stored information and a 
second pane that provides a local view of the information. A user submits 
queries and the keywords entered are displayed on the user's screen in a 
particular document as being highlighted. See Blumenthal at column 7, lines 1- 
25. 

Cited References Fail to Describe All Limitations 

With regards to the first limitation of claim 1, the Appellants traverse the 
Examiner's assertion that the Meyerzon reference discloses that "extracting 
contents of the web document for rendering an intermediate dynamically 
constructed in-memorv webpaqe representation of the web document at a hub 
processing unit which is formatted as if displayed for viewing on an end-user's 
web browser ". The Examiner, in the Final Office Action, cites Meyerzon, column 
7, lines 60-65 and column 8, lines 15-10. The cited portions of the Meyerzon 
reference are limited to 1.) a web crawler retrieving electronic documents and 
data associated with the documents, and 2.) a browser that locates and displays 
documents to a user. The Examiner, in the Advisory Action, further cites 
Meyerzon, column 2, lines 46-55; column 9, lines 29-59; column 11, lines 27-12 
and 53; and column 14, lines 32-54. These cited portions of the Meyerzon 
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reference are limited to 1 .) retrieving a copy of the document and parsing the 
retrieved copy (column 2, lines 46-55); 2.) parsing the document to return a list of 
text and properties, wherein the text and properties are obtained from tags within 
the html document (column 9, lines 29-59); 3.) various types of web crawls, i.e., 
*flrst full crawl", "full crawl 11 , and "incremental crawl" (column 11, lines 27-12 and 
53); and 4.) computing a hash value for the retrieved document (column 14, lines 
32-54). 

The Appellants respectfully assert that the retrieving and parsing of 
documents disclosed by Meyerzon is not "retrieving a web document at an 
address, and extracting contents of the web document for rendering an 
intermediate dynamically constructed in-memorv webpaae representation of the 
web document at a hub processing unit which is formatted as if displayed for 
viewing on an end-user's web browser " as is recited for independent claims 1 , 
14, and 20. Meyerzon explicitly states that text and properties are obtained from 
tags within the HTML documents. See Meyerzon at column 9, lines 9-43. 
Therefore, Meyerzon is working on HTML source code, as compared to an 
" intermediate dynamically constructed in-memorv webpaoe representation of the 
web document at a hub processing unit which is formatted as if displayed for 
viewing on an end-user's web browser ", as recited for independent claims 1, 14, 
and 20. 

For example, the following diagram illustrates how Meyerzon renders the 
webpage at the client side only: 




Hub 
Processrng 
Unit 



Meyerzon U.S. Patent No. 
6.638.314 



Extractor Summarizer 



Rendering 



Client 
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In contrast, the following diagram illustrates how the presently claimed invention 
renders an in-memory representation of the webpage: 




Presently Claimed Invention 



Hub 

Processing 
Unit 



Rendering 



Extractor Summarizer 



Client 



As can be seen from the above diagrams, Meyerzon does not teach, 
suggest, or anticipate rendering an intermediate dynamically constructed in- 
memorv webpage representation of the web document at a hub processing unit 
which is formatted as if displayed for viewing on an end-user's web browser . As 
stated above, Meyerzon teaches indexing electronic documents. The web 
crawler program of Meyerzon retrieves electronic documents and associated 
data. See Meyerzon at col. 7, lines 60-67, The information is passed to an 
indexing engine which creates an index of the retrieved data. The index contains 
reference information and pointers to corresponding electronic documents, for 
example, keywords. See Meyerzon at col. 8, lines 1-16. 

When a user requests a search, the search engines examines its index 
and returns a list of documents to the browser of the user's computer. See 
Meyerzon at coL 8, lines 26-35. Meyerzon is not teaching extracting contents of 
the web document for rendering an intermediate dynamically constructed in- 
memory webpage representation of the web document at a hub processing unit 
which is formatted as if displayed for viewing on an end-user's web browser . In 
fact, Meyerzon is teaching indexing information (i.e., extracting keywords) on 
electronic documents, which is not the same as extracting contents for rendering 
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an intermediate dynamically constructed in-memorv webpage representation of 
the web document . Meyerzon especially does not teach rendering, by a hub 
processing unit, an in-memory webpage as if displayed for viewing on an end- 
user's web browser . The Examiner's citations of the Meyerzon reference and the 
remainder of the reference are completely absent a teaching of the claim element 
above. The advantage of the present invention of the summarization of a 
document being based on the whole and complete document as it was designed 
by the document's author; the static heterogeneous data, as well as the 
problematic dynamic data, is completely rendered and integrated into the 
metadata for subsequent indexing of all metadata by a web crawler is not 
realized by Meyerzon. 

With regards to the second and third limitations of claim 1 , the above 
arguments with respect to the " in-memory webpage representation of the web 
document" are likewise applicable here and will not be repeated. As stated 
above, nowhere does Meyerzon teach, anticipate, or suggest an " in-memorv 
webpage representation of the web document " and therefore cannot teach, 
anticipate, or suggest " loading secondary documents associated with the web 
document in order to render the secondary documents as part of the in-memorv 
webpage representation ..." and/or "analyzing and summarizing the in-memorv 
webpage representation... " 

The Appellants further assert that the Lawrence reference is completely 
silent on "rendering an intermediate dynamically constructed in-memorv 
webpage representation of the web document at a hub processing unit which is 
formatted as if displayed for viewing on an end-user's web browser ". The 
Lawrence reference also does not teach, suggest, or anticipate loading 
secondary documents associated with the web document in order to render the 
secondary documents as part of the in-memory webpage representation, wherein 
the secondary documents include one or more images with textual content 
embedded therein, wherein the hub processing unit renders the in-memorv 
webpage prior to analyzing and summarizing the in-memorv webpage . The 
advantage of the present invention of the summarization of a document being 
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based on the whole and complete document as it was designed by the 
document's author; the static heterogeneous data as well as the problematic 
dynamic data is completely rendered and integrated into the metadata for 
subsequent indexing of all metadata by a web crawler is not realized by 
Lawrence. 

With regards to Blumenthal, the Appellants traverse the Examiner's 
assertion that Blumenthal discloses " wherein the hub processing unit renders the 
in-mernorv weboaqe prior to analyzing and summarizing the in-memorv 
webpaqe ". The Examiner, in the Final Office Action, cites Blumenthal, column 
17, lines 45-53. The cited portions of Blumenthal are limited to the additional 
rendering of a cached document at the client computer. For example, the 
following diagrams are provided to assist in describing the above technical 
differences between Blumenthal and the present invention. 
Starting with Blumenthal, the following diagram illustrates how the additional 
rendering of areas of a cached document is rendered at the client side only: 



Document (can be 
stored locally or on 
a network) 




Blumenthal U.S. Patent No. 
6,638,314 



Rendering of 
additional areas of 
cached document 



Client 
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In contrast, the following diagram illustrates how the presently claimed invention 
renders a complete in-memory representation of the webpage at a hub 
processing unit : 



Web 

Document 



Hub 

Processing 
Unit 




Gatherer 



Rendering 



Presently Claimed Invention 



Extractor Summarizer 



Client 



As can be seen from the above diagrams, Blumenthal renders additional areas 
of the cached document on the client side . The present invention, on the other 
hand, renders the in-memory representation at a hub processing unit , as it would 
be displayed on a user's web browser 

Furthermore, Blumenthal only teaches rendering additional areas of a 
cached document and not a complete in-memorv webpaae . See Blumenthal at 
col. 17, lines 45-52 and FIG. 13. The present invention, on the other hand, 
renders an in-memory webpaae as it would be displayed on a user's web 
browser and not iust areas of the webpage . Therefore, the present invention is 
able to summarize the document based on the whole and complete document as 
it was designed by the document's author; the static heterogeneous data, as well 
as the problematic dynamic data, is completely rendered and integrated into the 
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metadata for subsequent indexing of all metadata by a web crawler. Lawrence 
does not provide such an advantageous system. 

Moreover Blumenthal states at col. 17, lines 45-53 "...the cached 
document oan be rendered..." wherein the term "render" relates to the visual 
display of a document and not the construction of an in-rnemory data structure, 
as recited for the present invention. See Blumenthal at col. 17, lines 45-53. The 
present invention, on the other hand, recites "renders the in-memorv webpag e" 
wherein the term "renders" is not implying a visual display of a document, but 
rather the construction of a data structure of the webpage in memory, which is 
subsequently analyzed and summarized. This distinction is important. The 
teachings of Blumenthal are directed to the visual display of a document on a 
client system. This is not an intermediary representation of the complete web 
page along with " the secondary documents " which are loaded " as part of the in- 
memorv representation ." The visual representation as described by Blumenthal 
is not subsequently indexed. Accordingly, the teachings of Blumenthal are 
completely inoperable in this regard. 

Furthermore, as the references do not teach a hub processing unit for 
" retrieving a web document at an address, and extracting contents of the web 
document for rendering an intermediate dynamically constructed in-memorv 
weboaoe representation of the web document at a hub processing unit which is 
formatted as if displayed for viewing on an end-user's web browser... wherein 
the hub processing unit renders the in-memorv weboaoe prior to analyzing and 
summarizin g the in-memorv webpage..." the Appellants respectfully assert that 
the suggestion for these elements cannot come from the Applicant's own 
specification. The Federal Circuit has repeatedly warned against using the 
Applicant's disclosure as a blueprint to reconstruct the claimed invention out of 
isolated teachings of the prior art. See MPEP §2143 and Grain Processing Corp, 
v. American Maize-Products, 840 F.2d 902, 907, 5 USPQ2d 1788 1792 (Fed. Cir. 
1988) and In re Fitch, 972 F.2d 160, 12 USPQ2d 1780, 1783-84 (Fed. Cir. 1992). 
The references of Meyerzon and/or Lawrence and/or Blumenthal do not even 
suggest, teach, or mention these claim limitations. 

ARC9-2000-0046-US1 14 09/607,370 
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Moreover, the Federal Circuit has consistently held that when a §103 
rejection is based upon a modification of a reference that destroys the intent, 
purpose, or function of the invention disclosed in the reference, such a proposed 
modification is not proper and the prima facie case of obviousness can not be 
properly made. See In re Gordon , 733 F.2d 900, 221 USPQ 1125 (Fed. Cir. 
1984). Here, the intent, purpose, and function of Meyerzon taken alone and/or in 
view of Lawrence and/or in further view of Blumenthal is the indexing of 
electronic documents for use by a search engine allowing a user to select a 
document to be displayed by a client-side web browser. The rendering of a 
webpage only occurs at the client side . Because Meyerzon does not render an 
in-memorv webpaqe as it would be displayed on a user's web browser or render 
the in-memorv webpage prior to analyzing and summarizing the in-memorv 
webpage . this combination as suggested by the Examiner destroys the intent and 
purpose of Meyerzon. In contrast, the intent and purpose of the present 
invention is to render an in-memory webpage representation of a web document 
prior to analyzing and summarizing the in-memory webpage. Accordingly, the 
combination of Meyerzon and Lawrence in further view of Blumenthal results in 
an inoperable system, and the Examiner's case of "Prima Facie Obviousness 1 ' 
should be withdrawn. 

Additionally, the Federal Circuit stated in McGinlev v. Franklin Sports. Inc. . 
(Fed Cir 2001) that if references taken in combination would produce a 
"seemingly inoperative device," such references teach away from the 
combination and thus cannot serve as predicates for a prima fade case of 
obviousness. In re Sponnoble . 405 F.2d 578, 587, 160 USPQ 237, 244 (CCPA 
1969) (references teach away from combination if combination produces 
seemingly inoperative device); see also In re Gordon . 733 F.2d 900, 902, 221 
USPQ 1125, 1127 (Fed. Cir. 1984) (inoperable modification teaches away). 
Here, Meyerzon teaches rendering an electronic document for display on a web 
browser at the client side. Therefore, the combination of Meyerzon with 
Lawrence and/or in further view of Blumenthal to produce the presently claimed 
invention where an in-memory webpage representation of a web document is 
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rendered prior to analyzing and summarizing the in-memory webpage would 
produce an inoperative device. Accordingly, the combination of Meyerzon and 
Lawrence in further view of Blumenthal is improper. 

Accordingly, independent claims 1, 14, and 20 distinguish over Meyerzon 
and/or Lawrence and/or Blumenthal for at least the reasons stated above. 
Claims 2-13, and 15-19 depend from claims 1, 14, and 20 respectively. Since 
dependent claims contain all the limitations of the independent claims, claims 2- 
13, and 15-19 distinguish over Meyerzon and/or Lawrence and/or Blumenthal as 
well. Therefore, the rejection of claims 1-20 should be reversed. 

CONCLUSION 



For the reasons stated above, Appellants respectfully contend that each 
claim is patentable. Therefore, reversal of all rejections is courteously solicited. 

Respectfully submitted, 



Dated: May 5, 2006 By:_ 



Jon Gibbons 
Registration No. 37,333 
Attorney for Appellants 



Fleit, Kain, Gibbons, Gutman & Bongini 
One Boca Commerce Center,. Suite 111 
551 N.W. 77 th Street 
Boca Raton, FL 33487 
Tel. (561)989-9811 
Fax (561)989-9812 
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VIIL CLAIMS APPENDIX 

1. A method for browser-enhanced web crawling associated with a network 
of hub processing units coupled to a plurality of information processing units over 
a network, the method executed by a web crawler on a hub processing unit 
associated with the network comprising: 

retrieving a web document at an address, and extracting contents of the 
web document for rendering an intermediate dynamically constructed in-memory 
webpage representation of the web document at a hub processing unit which is 
formatted as if displayed for viewing on an end-user's web browser; 

loading secondary documents associated with the web document in order 
to render the secondary documents as part of the in-memory webpage 
representation, wherein the secondary documents include one or more images 
with textual content embedded therein, wherein the hub processing unit renders 
the in-memory webpage prior to analyzing and summarizing the in-memory 
webpage; 

analyzing and summarizing the in-memory webpage representation to 
produce a text map for the webpage document of the textual contents; and 

using optical character recognition on the images to extract textual content 
for adding to the textual map for the webpage document. 

2. The method as defined in claim 1 , wherein the retrieving the web 
document at an address further comprises retrieving a document at an address 
selected from the group of addresses consisting of a nodal address, a network 
address, a URL and equivalents. 
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3. The method as defined in claim 1 , wherein the one or more images with 
textual content embedded therein include at least one of an in-line GIF image 
and an in-line JPEG image. 

4. The method as defined in claim 1, wherein the loading secondary 
documents further comprises the loading of secondary documents including one 
or more Java applets with textual content embedded therein. 

5. The method as defined in claim 1 , wherein the loading secondary 
documents further comprises the loading of secondary documents including web 
documents selected from the group of documents consisting of in-line frames, 
frames, and equivalents. 

6. The method as defined in claim 4, wherein the loading secondary 
documents further comprises the loading of secondary documents including one 
or more Java Script components with textual content embedded therein. 
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7. The method as defined in claim 1 , wherein the retrieving the web 
document further comprises performing the following sub-steps of: 

initializing a first list with seed values; 

checking if there are any URLs to be processed and jn response that any 
URL exists to be processed then performing the following sub-steps of: 

determining if a URL is in a second list; and in response that a URL 
is not in the second list then performing the following sub-steps of: 
inserting the URL into the first list; 
scheduling the URL for crawling; 
crawling the URL when scheduled to do so; 
removing the URL from the first list after the scheduled 

crawling; 

entering the URL into the second list; and 
repeating the checking step until there are no more URLs to be 
processed; 

where if the determining step determines that the URL is in the second list then 
repeating the checking step until there are no more URLs to be processed. 

8. The method as defined in claim 7, wherein the sub-step of initializing a 
first list with seed values further includes the list being a URL pool. 

9. The method as defined in claim 7, wherein the sub-step of determining if a 
URL is in a second list further includes the second list being a visited pool. 
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10. The method as defined in claim 7, wherein the sub-step of crawling further 
comprises the sub-steps of: 

issuing an HTTP command to a web server named in the URL; 
receiving contents of an HTML page as a result of the issued HTTP 
command; and 

passing on the contents of the HTML page to a Page Rendering 
subroutine. 

1 1 . The method as defined in claim 10, further including the sub-steps 
performed by the Page Rendering subroutine comprising: 

receiving the contents of the HTML page in the Page Rendering 
subroutine; 

building an in-memory representation of a Layout for the HTML page and 
if more data is needed to property form the representation, then performing the 
sub-steps of: 

requesting additional web-based information; 
gathering this additional web-based information; 
inserting any URLs associated with this additional web-based 
information into the second list and a URL cache; 
building a final amended representation; and 
forwarding the final amended representation to an Extraction 

subroutine; 

wherein, if no more data is needed to properly form the in-memory 
representation, then forwarding the in-memory representation to the Extraction 
subroutine. 
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12. The method as defined in claim 1 1 , further including the sub-steps 
performed by the Page Extraction subroutine comprising: 

accessing a set of memory structures of the Page Renderer; 

copying a text portion of the structures into a text map; 

inspecting any in-line GIF and JPEG image references in the memory 
structures; 

extracting alternate text attributes; 

adding the alternate text attributes to a text map; 

invoking an optical character recognition engine; 

analyzing any in-line GIF and JPEG images using the optical character 
recognition engine for text content; 

extracting text content from the GIF and JPEG images; 

adding text content from the images to the text map; and 

forwarding the text map to a Page Summarizer subroutine. 

13. The method as defined in claim 12, further including the sub-steps 
performed by the Page Summarizer subroutine comprising: 

receiving a text map from the Page Extractor subroutine; 

processing the text map in an application-specific manner; 

applying data extraction patterns to the text map; 

translating resultant data from the applying step; 

forwarding any URLs present in the text map to a manager subroutine; 

and 

forwarding any extracted data and metadata to application logic. 
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14. A computer readable medium including programming instructions, the 
programming instructions including instructions for browser-enhanced web 
crawling associated with a network of hub processing units coupled to a plurality 
of information processing units over a network, the browser enhanced web 
crawling instructions on the computer readable medium comprising: 

retrieving instructions for retrieving a web document at an address, and 
extracting contents of the web document for rendering an intermediate 
dynamically constructed in-memory webpage representation of the web 
document at a hub processing unit which is formatted as if displayed for viewing 
on an end-user's web browser; 

loading instructions for loading secondary documents associated with the 
web document in order to render the secondary documents as part of the in- 
memory webpage representation, wherein the secondary documents include one 
or more images with textual content embedded therein, and wherein the hub 
processing unit renders the in-memory webpage prior to analyzing and 
summarizing the in-memory webpage; 

analyzing and summarizing instructions for analyzing and summarizing the 
in-memory webpage representation to produce a text map for the webpage 
document of the textual contents therein; and 

using optical character recognition on the images to extract textual content 
for adding to the textual map for the webpage document. 

15. The computer readable medium as defined in claim 14, wherein the 
retrieving instructions for retrieving a web document at an address further 
comprises retrieving instructions for retrieving a document at an address selected 
from the group of addresses consisting of a nodal address, a network address, a 
URL and equivalents. 
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16. The computer readable medium as defined in claim 14, wherein the one or 
more images with textual content embedded therein include at least one of an in- 
line GIF image and an in-line JPEG image. 

17. The computer readable medium as defined in claim 14, wherein the 
loading secondary documents further comprises the loading of secondary 
documents including one or more Java applets with textual content embedded 
therein. 

18. The computer readable medium as defined in claim 14, wherein the 
loading instructions for loading secondary documents further comprises loading 
instructions for loading of secondary documents including web documents 
selected from the group of documents consisting of in-line frames, frames, and 
equivalents. 

19. The computer readable medium as defined in claim 17, wherein loading 
secondary documents further comprises the loading of secondary documents 
including one or more Java Script components with textual content embedded 
therein. 
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20. A browser-enhanced web crawling unit associated with a network of a 
plurality of hub processing units coupled to a plurality of information processing 
units over a network, the browser enhanced web crawling unit on a hub 
processing unit comprising: 

a retrieval unit for retrieving a web document at an address, and extracting 
contents of the web document for rendering an intermediate dynamically 
constructed in-memory webpage representation of the web document at a hub 
processing unit which is formatted as if displayed for viewing on an end-user's 
web browser; 

a loader for loading secondary documents as required associated with the 
web document in order to render the secondary documents as part of the in- 
memory webpage representation, wherein the secondary documents include one 
or more images with textual content embedded therein, wherein the hub 
processing unit renders the in-memory webpage prior to anafyzing and 
summarizing the in-memory webpage; 

a summarizer for analyzing and summarizing the in-memory webpage 
representation to produce a text map for the webpage document of the textual 
contents therein; and 

an optical character recognition engine for use on the images to extract 
textual content for adding to the textual map for the webpage document. 
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IX. EVIDENCE APPENDIX 
NONE 
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X. RELATED PROCEEDINGS APPENDIX 



NONE 
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